116 research outputs found
MANSY: Generalizing Neural Adaptive Immersive Video Streaming With Ensemble and Representation Learning
The popularity of immersive videos has prompted extensive research into
neural adaptive tile-based streaming to optimize video transmission over
networks with limited bandwidth. However, the diversity of users' viewing
patterns and Quality of Experience (QoE) preferences has not been fully
addressed yet by existing neural adaptive approaches for viewport prediction
and bitrate selection. Their performance can significantly deteriorate when
users' actual viewing patterns and QoE preferences differ considerably from
those observed during the training phase, resulting in poor generalization. In
this paper, we propose MANSY, a novel streaming system that embraces user
diversity to improve generalization. Specifically, to accommodate users'
diverse viewing patterns, we design a Transformer-based viewport prediction
model with an efficient multi-viewport trajectory input output architecture
based on implicit ensemble learning. Besides, we for the first time combine the
advanced representation learning and deep reinforcement learning to train the
bitrate selection model to maximize diverse QoE objectives, enabling the model
to generalize across users with diverse preferences. Extensive experiments
demonstrate that MANSY outperforms state-of-the-art approaches in viewport
prediction accuracy and QoE improvement on both trained and unseen viewing
patterns and QoE preferences, achieving better generalization.Comment: This work has been submitted to the IEEE Transactions on Mobile
Computing for possible publication. Copyright may be transferred without
notice, after which this version may no longer be accessibl
Deep Learning for Edge Computing Applications: A State-of-the-Art Survey
With the booming development of Internet-of-Things (IoT) and communication technologies such as 5G, our future world is envisioned as an interconnected entity where billions of devices will provide uninterrupted service to our daily lives and the industry. Meanwhile, these devices will generate massive amounts of valuable data at the network edge, calling for not only instant data processing but also intelligent data analysis in order to fully unleash the potential of the edge big data. Both the traditional cloud computing and on-device computing cannot sufficiently address this problem due to the high latency and the limited computation capacity, respectively. Fortunately, the emerging edge computing sheds a light on the issue by pushing the data processing from the remote network core to the local network edge, remarkably reducing the latency and improving the efficiency. Besides, the recent breakthroughs in deep learning have greatly facilitated the data processing capacity, enabling a thrilling development of novel applications, such as video surveillance and autonomous driving. The convergence of edge computing and deep learning is believed to bring new possibilities to both interdisciplinary researches and industrial applications. In this article, we provide a comprehensive survey of the latest efforts on the deep-learning-enabled edge computing applications and particularly offer insights on how to leverage the deep learning advances to facilitate edge applications from four domains, i.e., smart multimedia, smart transportation, smart city, and smart industry. We also highlight the key research challenges and promising research directions therein. We believe this survey will inspire more researches and contributions in this promising field
From Capture to Display: A Survey on Volumetric Video
Volumetric video, which offers immersive viewing experiences, is gaining
increasing prominence. With its six degrees of freedom, it provides viewers
with greater immersion and interactivity compared to traditional videos.
Despite their potential, volumetric video services poses significant
challenges. This survey conducts a comprehensive review of the existing
literature on volumetric video. We firstly provide a general framework of
volumetric video services, followed by a discussion on prerequisites for
volumetric video, encompassing representations, open datasets, and quality
assessment metrics. Then we delve into the current methodologies for each stage
of the volumetric video service pipeline, detailing capturing, compression,
transmission, rendering, and display techniques. Lastly, we explore various
applications enabled by this pioneering technology and we present an array of
research challenges and opportunities in the domain of volumetric video
services. This survey aspires to provide a holistic understanding of this
burgeoning field and shed light on potential future research trajectories,
aiming to bring the vision of volumetric video to fruition.Comment: Submitte
Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation
Multimodal learning has seen great success mining data features from multiple
modalities with remarkable model performance improvement. Meanwhile, federated
learning (FL) addresses the data sharing problem, enabling privacy-preserved
collaborative training to provide sufficient precious data. Great potential,
therefore, arises with the confluence of them, known as multimodal federated
learning. However, limitation lies in the predominant approaches as they often
assume that each local dataset records samples from all modalities. In this
paper, we aim to bridge this gap by proposing an Unimodal Training - Multimodal
Prediction (UTMP) framework under the context of multimodal federated learning.
We design HA-Fedformer, a novel transformer-based model that empowers unimodal
training with only a unimodal dataset at the client and multimodal testing by
aggregating multiple clients' knowledge for better accuracy. The key advantages
are twofold. Firstly, to alleviate the impact of data non-IID, we develop an
uncertainty-aware aggregation method for the local encoders with layer-wise
Markov Chain Monte Carlo sampling. Secondly, to overcome the challenge of
unaligned language sequence, we implement a cross-modal decoder aggregation to
capture the hidden signal correlation between decoders trained by data from
different modalities. Our experiments on popular sentiment analysis benchmarks,
CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms
state-of-the-art multimodal models under the UTMP federated learning
frameworks, with 15%-20% improvement on most attributes.Comment: 10 pages,5 figure
ILCAS: Imitation Learning-Based Configuration-Adaptive Streaming for Live Video Analytics with Cross-Camera Collaboration
The high-accuracy and resource-intensive deep neural networks (DNNs) have
been widely adopted by live video analytics (VA), where camera videos are
streamed over the network to resource-rich edge/cloud servers for DNN
inference. Common video encoding configurations (e.g., resolution and frame
rate) have been identified with significant impacts on striking the balance
between bandwidth consumption and inference accuracy and therefore their
adaption scheme has been a focus of optimization. However, previous
profiling-based solutions suffer from high profiling cost, while existing deep
reinforcement learning (DRL) based solutions may achieve poor performance due
to the usage of fixed reward function for training the agent, which fails to
craft the application goals in various scenarios. In this paper, we propose
ILCAS, the first imitation learning (IL) based configuration-adaptive VA
streaming system. Unlike DRL-based solutions, ILCAS trains the agent with
demonstrations collected from the expert which is designed as an offline
optimal policy that solves the configuration adaption problem through dynamic
programming. To tackle the challenge of video content dynamics, ILCAS derives
motion feature maps based on motion vectors which allow ILCAS to visually
``perceive'' video content changes. Moreover, ILCAS incorporates a cross-camera
collaboration scheme to exploit the spatio-temporal correlations of cameras for
more proper configuration selection. Extensive experiments confirm the
superiority of ILCAS compared with state-of-the-art solutions, with 2-20.9%
improvement of mean accuracy and 19.9-85.3% reduction of chunk upload lag.Comment: This work has been submitted to the IEEE Transactions on Mobile
Computing for possible publication. Copyright may be transferred without
notice, after which this version may no longer be accessibl
Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction
Volumetric video emerges as a new attractive video paradigm in recent years
since it provides an immersive and interactive 3D viewing experience with six
degree-of-freedom (DoF). Unlike traditional 2D or panoramic videos, volumetric
videos require dense point clouds, voxels, meshes, or huge neural models to
depict volumetric scenes, which results in a prohibitively high bandwidth
burden for video delivery. Users' behavior analysis, especially the viewport
and gaze analysis, then plays a significant role in prioritizing the content
streaming within users' viewport and degrading the remaining content to
maximize user QoE with limited bandwidth. Although understanding user behavior
is crucial, to the best of our best knowledge, there are no available 3D
volumetric video viewing datasets containing fine-grained user interactivity
features, not to mention further analysis and behavior prediction. In this
paper, we for the first time release a volumetric video viewing behavior
dataset, with a large scale, multiple dimensions, and diverse conditions. We
conduct an in-depth analysis to understand user behaviors when viewing
volumetric videos. Interesting findings on user viewport, gaze, and motion
preference related to different videos and users are revealed. We finally
design a transformer-based viewport prediction model that fuses the features of
both gaze and motion, which is able to achieve high accuracy at various
conditions. Our prediction model is expected to further benefit volumetric
video streaming optimization. Our dataset, along with the corresponding
visualization tools is accessible at
https://cuhksz-inml.github.io/user-behavior-in-vv-watching/Comment: Accepted by ACM MM'2
LiveVV: Human-Centered Live Volumetric Video Streaming System
Volumetric video has emerged as a prominent medium within the realm of
eXtended Reality (XR) with the advancements in computer graphics and depth
capture hardware. Users can fully immersive themselves in volumetric video with
the ability to switch their viewport in six degree-of-freedom (DOF), including
three rotational dimensions (yaw, pitch, roll) and three translational
dimensions (X, Y, Z). Different from traditional 2D videos that are composed of
pixel matrices, volumetric videos employ point clouds, meshes, or voxels to
represent a volumetric scene, resulting in significantly larger data sizes.
While previous works have successfully achieved volumetric video streaming in
video-on-demand scenarios, the live streaming of volumetric video remains an
unresolved challenge due to the limited network bandwidth and stringent latency
constraints. In this paper, we for the first time propose a holistic live
volumetric video streaming system, LiveVV, which achieves multi-view capture,
scene segmentation \& reuse, adaptive transmission, and rendering. LiveVV
contains multiple lightweight volumetric video capture modules that are capable
of being deployed without prior preparation. To reduce bandwidth consumption,
LiveVV processes static and dynamic volumetric content separately by reusing
static data with low disparity and decimating data with low visual saliency.
Besides, to deal with network fluctuation, LiveVV integrates a volumetric video
adaptive bitrate streaming algorithm (VABR) to enable fluent playback with the
maximum quality of experience. Extensive real-world experiment shows that
LiveVV can achieve live volumetric video streaming at a frame rate of 24 fps
with a latency of less than 350ms
- …